Systems and methods for improving recognition results via user-augmentation of a database

ABSTRACT

A system improves recognition results. The system receives multimedia data and recognizes the multimedia data based on training data to generate documents. The system receives user augmentation relating to one of the documents or new documents from a user. The system supplements the training data with the user augmentation or new documents and retrains based on the supplemented training data.

RELATED APPLICATION

[0001] This application claims priority under 35 U.S.C. § 119 based onU.S. Provisional Application Nos. 60/394,064 and 60/394,082, filed Jul.3, 2002, and Provisional Application No. 60/419,214, filed Oct. 17,2002, the disclosures of which are incorporated herein by reference.

[0002] This application is related to U.S. patent application Ser. No.10/______ (Docket No. 02-4042), entitled, “Continuous Learning forSpeech Recognition Systems,” filed concurrently herewith, the disclosureof which is incorporated herein by reference.

BACKGROUND OF THE INVENTION

[0003] 1. Field of the Invention

[0004] The present invention relates generally to multimediaenvironments and, more particularly, to systems and methods forimproving recognition results of a multimedia recognition system viauser-augmentation of a linguistic database.

[0005] 2. Description of Related Art

[0006] Current multimedia recognition systems obtain multimediadocuments from a fixed set of sources. These documents include audiodocuments (e.g., radio broadcasts), video documents (e.g., televisionbroadcasts), and text documents (e.g., word processing documents). Atypical recognition system processes the documents and stores them in adatabase. In the case of audio or video documents, the recognitionsystem might transcribe the documents to identify information, such asthe words spoken, the identity of one or more speakers, one or moretopics relating to the documents, and, in the case of video, theidentity of one or more entities (persons, places, objects, etc.)appearing in the video.

[0007] When a user later desires to access the documents, the userusually queries or searches the database. For example, the user mightuse a standard database interface to submit a query relating todocuments of interest. The database would then process the query toretrieve documents that are relevant to the query and present thedocuments (or a list of the documents) to the user. The documentsprovided to the user are usually only as good, however, as therecognition system that created them.

[0008] It has been found that the recognition results of a multimediarecognition system typically degrade over time, as new words areintroduced into the system. Oftentimes, the recognition system cannotaccurately recognize the new words.

[0009] Accordingly, it is desirable to improve recognition results of amultimedia recognition system.

SUMMARY OF THE INVENTION

[0010] Systems and methods consistent with the present invention permitusers to augment a database of a multimedia recognition system byannotating, attaching, inserting, correcting, and/or enhancingdocuments. The systems and methods use this user-augmentation to improvethe recognition results of the recognition system.

[0011] In one aspect consistent with the principles of the invention, asystem improves recognition results. The system receives multimedia dataand recognizes the multimedia data based on training data to generatedocuments. The system receives user augmentation relating to one of thedocuments. The system supplements the training data with the useraugmentation and retrains based on the supplemented training data.

[0012] In another aspect consistent with the principles of theinvention, a multimedia recognition system receives different types ofmultimedia data and recognizes the multimedia data based on trainingdata to generate recognition results. The system obtains new documentsfrom one or more users and adds the new documents to the training datato obtain new training data. The system retrains based on the newtraining data.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] The accompanying drawings, which are incorporated in andconstitute a part of this specification, illustrate the invention and,together with the description, explain the invention. In the drawings,

[0014]FIG. 1 is a diagram of a system in which systems and methodsconsistent with the present invention may be implemented;

[0015]FIG. 2 is an exemplary diagram of the audio indexer of FIG. 1according to an implementation consistent with the principles of theinvention;

[0016]FIG. 3 is an exemplary diagram of the recognition system of FIG. 2according to an implementation consistent with the present invention;

[0017]FIG. 4 is an exemplary diagram of the memory system of FIG. 1according to an implementation consistent with the principles of theinvention;

[0018]FIG. 5 is a flowchart of exemplary processing for correctingand/or enhancing documents according to an implementation consistentwith the principles of the invention;

[0019]FIG. 6 is a diagram of an exemplary graphical user interface thatfacilitates correction and/or enhancement of a document according to animplementation consistent with the principles of the invention;

[0020]FIG. 7 is a flowchart of exemplary processing for annotatingdocuments with bookmarks, highlights, and notes according to animplementation consistent with the principles of the invention;

[0021]FIG. 8 is a diagram of an exemplary graphical user interface thatdisplays an annotated document according to an implementation consistentwith the principles of the invention;

[0022]FIG. 9 is a flowchart of exemplary processing for attachingdocuments according to an implementation consistent with the principlesof the invention;

[0023]FIG. 10 is a diagram of an exemplary graphical user interface thatfacilitates attachment of a document according to an implementationconsistent with the principles of the invention; and

[0024]FIG. 11 is a flowchart of exemplary processing for adding newdocuments according to an implementation consistent with the principlesof the invention.

DETAILED DESCRIPTION

[0025] The following detailed description of the invention refers to theaccompanying drawings. The same reference numbers in different drawingsmay identify the same or similar elements. Also, the following detaileddescription does not limit the invention. Instead, the scope of theinvention is defined by the appended claims and equivalents.

[0026] Systems and methods consistent with the present invention permitusers to augment a database of a multimedia recognition system by, forexample, annotating, attaching, inserting, correcting, and/or enhancingdocuments. The systems and methods may use this user-augmentation toimprove the recognition results of the recognition system. For example,the user-augmentation may be used to improve the documents stored in thedatabase. The user-augmentation may also be used for system retraining.

EXEMPLARY SYSTEM

[0027]FIG. 1 is a diagram of an exemplary system 100 in which systemsand methods consistent with the present invention may be implemented.System 100 may include multimedia sources 110, indexers 120, memorysystem 130, and server 140 connected to clients 150 via network 160.Network 160 may include any type of network, such as a local areanetwork (LAN), a wide area network (WAN) (e.g., the Internet), a publictelephone network (e.g., the Public Switched Telephone Network (PSTN)),a virtual private network (VPN), or a combination of networks. Thevarious connections shown in FIG. 1 may be made via wired, wireless,and/or optical connections.

[0028] Multimedia sources 110 may include one or more audio sources 112,one or more video sources 114, and one or more text sources 116. Audiosource 112 may include mechanisms for capturing any source of audiodata, such as radio, telephone, and conversations, in any language, andproviding the audio data, possibly as an audio stream or file, toindexers 120. Video source 114 may include mechanisms for capturing anysource of video data, with possibly integrated audio data in anylanguage, such as television, satellite, and a camcorder, and providingthe video data, possibly as a video stream or file, to indexers 120.Text source 116 may include mechanisms for capturing any source of text,such as e-mail, web pages, newspapers, and word processing documents, inany language, and providing the text, possibly as a text stream or file,to indexers 120.

[0029] Indexers 120 may include one or more audio indexers 122, one ormore video indexers 124, and one or more text indexers 126. Each ofindexers 122, 124, and 126 may include mechanisms that receive data frommultimedia sources 110, process the data, perform feature extraction,and output analyzed, marked-up, and enhanced language metadata. In oneimplementation consistent with the principles of the invention, indexers122-126 include mechanisms, such as the ones described in John Makhoulet al., “Speech and Language Technologies for Audio Indexing andRetrieval,” Proceedings of the IEEE, Vol. 88, No. 8, August 2000, pp.1338-1353, which is incorporated herein by reference.

[0030] Audio indexer 122 may receive input audio data from audio sources112 and generate metadata therefrom. For example, indexer 122 maysegment the input data by speaker, cluster audio segments from the samespeaker, identify speakers by name or gender, and transcribe the spokenwords. Indexer 122 may also segment the input data based on topic andlocate the names of people, places, and organizations. Indexer 122 mayfurther analyze the input data to identify when each word was spoken(possibly based on a time value). Indexer 122 may include any or all ofthis information in the metadata relating to the input audio data.

[0031] Video indexer 124 may receive input video data from video sources122 and generate metadata therefrom. For example, indexer 124 maysegment the input data by speaker, cluster video segments from the samespeaker, identify speakers by name or gender, identify participantsusing face recognition, and transcribe the spoken words. Indexer 124 mayalso segment the input data based on topic and locate the names ofpeople, places, and organizations. Indexer 124 may further analyze theinput data to identify when each word was spoken (possibly based on atime value). Indexer 124 may include any or all of this information inthe metadata relating to the input video data.

[0032] Text indexer 126 may receive input text data from text sources116 and generate metadata therefrom. For example, indexer 126 maysegment the input data based on topic and locate the names of people,places, and organizations. Indexer 126 may further analyze the inputdata to identify when each word occurs (possibly based on a characteroffset within the text). Indexer 126 may also identify the author and/orpublisher of the text. Indexer 126 may include any or all of thisinformation in the metadata relating to the input text data.

[0033]FIG. 2 is an exemplary diagram of audio indexer 122. Video indexer124 and text indexer 126 may be similarly configured. Indexers 124 and126 may include, however, additional and/or alternate componentsparticular to the media type involved.

[0034] As shown in FIG. 2, indexer 122 may include training system 210,statistical model 220, and recognition system 230. Training system 210may include logic that estimates parameters of statistical model 220from a corpus of training data. The training data may initially includehuman-produced data. For example, the training data might include onehundred hours of audio data that has been meticulously and accuratelytranscribed by a human. Training system 210 may use the training data togenerate parameters for statistical model 220 that recognition system230 may later use to recognize future data that it receives (i.e., newaudio that it has not heard before).

[0035] Statistical model 220 may include acoustic models and languagemodels. The acoustic models may describe the time-varying evolution offeature vectors for each sound or phoneme. The acoustic models mayemploy continuous hidden Markov models (HMMs) to model each of thephonemes in the various phonetic contexts.

[0036] The language models may include n-gram language models, where theprobability of each word is a function of the previous word (for abi-gram language model) and the previous two words (for a tri-gramlanguage model). Typically, the higher the order of the language model,the higher the recognition accuracy at the cost of slower recognitionspeeds.

[0037] Recognition system 230 may use statistical model 220 to processinput audio data. FIG. 3 is an exemplary diagram of recognition system230 according to an implementation consistent with the principles of theinvention. Recognition system 230 may include audio classification logic310, speech recognition logic 320, speaker clustering logic 330, speakeridentification logic 340, name spotting logic 350, topic classificationlogic 360, and story segmentation logic 370. Audio classification logic310 may distinguish speech from silence, noise, and other audio signalsin input audio data. For example, audio classification logic 310 mayanalyze each thirty second window of the input data to determine whetherit contains speech. Audio classification logic 310 may also identifyboundaries between speakers in the input stream. Audio classificationlogic 310 may group speech segments from the same speaker and send thesegments to speech recognition logic 320.

[0038] Speech recognition logic 320 may perform continuous speechrecognition to recognize the words spoken in the segments that itreceives from audio classification logic 310. Speech recognition logic320 may generate a transcription of the speech using statistical model220. Speaker clustering logic 330 may identify all of the segments fromthe same speaker in a single document (i.e., a body of media that iscontiguous in time (from beginning to end or from time A to time B)) andgroup them into speaker clusters. Speaker clustering logic 330 may thenassign each of the speaker clusters a unique label. Speakeridentification logic 340 may identify the speaker in each speakercluster by name or gender.

[0039] Name spotting logic 350 may locate the names of people, places,and organizations in the transcription. Name spotting logic 350 mayextract the names and store them in a database. Topic classificationlogic 360 may assign topics to the transcription. Each of the words inthe transcription may contribute differently to each of the topicsassigned to the transcription. Topic classification logic 360 maygenerate a rank-ordered list of all possible topics and correspondingscores for the transcription.

[0040] Story segmentation logic 370 may change the continuous stream ofwords in the transcription into document-like units with coherent setsof topic labels and other document features generated or identified bythe components of recognition system 230. This information mayconstitute metadata corresponding to the input audio data. Storysegmentation logic 370 may output the metadata in the form of documentsto memory system 130, where a document corresponds to a body of mediathat is contiguous in time (from beginning to end or from time A to timeB).

[0041] Returning to FIG. 1, memory system 130 may store documents fromindexers 120 and documents from clients 150, as will be described inmore detail below. FIG. 4 is an exemplary diagram of memory system 130according to an implementation consistent with the principles of theinvention. Memory system 130 may include loader 410, trainer 420, one ormore databases 430, and interface 440. Loader 410 may include logic thatreceives documents from indexers 120 and stores them in database 430.Trainer 420 may include logic that sends documents in the form oftraining data to indexers 120.

[0042] Database 430 may include a conventional database, such as arelational database, that stores documents from indexers 120. Database430 may also store documents received from clients 150 via server 140.Interface 440 may include logic that interacts with server 140 to storedocuments in database 130, query or search database 130, and retrievedocuments from database 130.

[0043] Returning to FIG. 1, server 140 may include a computer or anotherdevice that is capable of interacting with memory system 130 and clients150 via network 170. Server 140 may receive queries from clients 150 anduse the queries to retrieve relevant documents from memory system 130.Server 140 may also receive documents or link to documents from clients150 and store the documents in memory system 130. Clients 150 mayinclude personal computers, laptops, personal digital assistants, orother types of devices that are capable of interacting with server 140to retrieve documents from memory system 130 and provide documents, andpossibly other information, to memory system 130. Clients 150 maypresent information to users via a graphical user interface, such as aweb browser window.

EXEMPLARY PROCESSING

[0044] Systems and methods consistent with the present invention permitusers to augment memory system 130 to improve recognition results ofsystem 100. For example, the user-augmentation may be used to improvethe value of documents stored in memory system 130 and may also be usedto retrain indexers 120. The user-augmentation may include: (1)correction and/or enhancement of the documents; (2) annotation of thedocuments with bookmarks, highlights, and notes; (3) attachment of richdocuments to documents from memory system 130; and (4) insertion of richdocuments into system 100. Each of these will be described in detailbelow.

[0045] Document Correction and/or Enhancement

[0046]FIG. 5 is a flowchart of exemplary processing for correctingand/or enhancing documents according to an implementation consistentwith the principles of the invention. Processing may begin with a userdesiring to retrieve one or more documents from memory system 130. Theuser may use a conventional web browser of client 150 to access server140 in a conventional manner. To obtain documents of interest, the usermay generate a search query and send the query to server 140 via client150. Server 140 may use the query to search memory system 130 andretrieve relevant documents.

[0047] Server 140 may present the relevant documents to the user (act510). For example, the user may be presented with a list of relevantdocuments. The documents may include any combination of audio documents,video documents, and text documents. The user may select one or moredocuments on the list to view. In the case of an audio or videodocument, the user may be presented with a transcription of the audiodata or video data corresponding to the document.

[0048]FIG. 6 is a diagram of an exemplary graphical user interface (GUI)600 that facilitates correction and/or enhancement of a documentaccording to an implementation consistent with the principles of theinvention. In one implementation, GUI 600 is part of an interface of astandard Internet browser, such as Internet Explorer or NetscapeNavigator, or any browser that follows World Wide Web Consortium (W3C)specifications for HTML.

[0049] GUI 600 may include a speaker section 610, a transcriptionsection 620, and a topics section 630. Speaker section 610 may identifyboundaries between speakers, the gender of a speaker, and the name of aspeaker (when known). In this way, speaker segments are clusteredtogether over the entire document to group together segments from thesame speaker under the same label. In the example of FIG. 6, onespeaker, Elizabeth Vargas, has been identified by name.

[0050] Transcription section 620 may include a transcription of thedocument. In the example of FIG. 6, the document corresponds to videodata from a television broadcast of ABC's World News Tonight.Transcription section 620 may identify the names of people, places, andorganizations by visually distinguishing them in some manner. Forexample, people, places, and organizations may be identified usingdifferent colors. Topic section 630 may include topics relating to thetranscription in transcription section 620. Each of the topics maydescribe the main themes of the document and may constitute a veryhigh-level summary of the content of the transcription, even though theexact words in the topic may not be included in the transcription.

[0051] GUI 600 may also include a modify button 640. The user may selectmodify button 640 when the user desires to correct and/or enhance thedocument. Sometimes, the document is incomplete or incorrect in somemanner. For example, the document may identify unknown speakers bygender and may visually distinguish the names of people, places, andorganizations. If the user desires, the user may provide the name of anunknown speaker or identify that one of the words in the transcriptionis the name of a person, place, or organization by selecting modifybutton 640 and providing the correct information. Alternatively, thedocument may contain an incorrect topic or a misspelling. If the userdesires, the user may correct these items by selecting modify button 640and providing the correct information.

[0052] GUI 600 may receive the information provided by the user andmodify the document onscreen. This way, the user may determine whetherthe information was correctly provided. GUI 600 may also send themodified (i.e., corrected/enhanced) document to server 140.

[0053] Returning to FIG. 5, server 140 may receive the modified documentand send it to memory system 130 (act 520). Memory system 130 may storethe modified document in database 430 (FIG. 4) (act 530). Thereafter,when any user retrieves this document from database 430, the user getsthe document with the correction(s)/enhancement(s). This may aid theuser in browsing the document and determining whether the document isone in which the user is interested.

[0054] Memory system 130 may also send the modified document to one ormore of indexers 120 for retraining (act 540). Memory system 130 maysend the modified document in the form of training data. For example,memory system 130 may put the modified document in a special form foruse by indexers 120 to retrain. Alternatively, memory system 130 maysend the modified document to indexers 120, along with an instruction toretrain.

[0055] Training system 210 (FIG. 2) of indexers 120 may use the modifieddocument to retrain. For example, training system 210 may supplement itscorpus of training data with the modified document and generate newparameters for statistical model 220 based on the supplemented corpus oftraining data.

[0056] Suppose, for example, that the user provided the name of one ofthe speakers who was identified simply by gender in the document.Speaker identification logic 340 (FIG. 3) may use the name and thecorresponding original audio data to recognize this speaker in thefuture. It may take more than a predetermined amount of audio from aspeaker (e.g., more than five minutes of speech) before speakeridentification logic 340 can begin to automatically recognize the speechfrom the speaker. By retraining based on corrected and/or enhanceddocuments, indexers 120 improve their recognition results.

[0057] Document Annotation

[0058]FIG. 7 is a flowchart of exemplary processing for annotatingdocuments with bookmarks, highlights, and notes according to animplementation consistent with the principles of the invention.Processing may begin with a user desiring to retrieve one or moredocuments from memory system 130. The user may use a conventional webbrowser of client 150 to access server 140 in a conventional manner. Toobtain documents of interest, the user may generate a search query andsend the query to server 140 via client 150. Server 140 may use thequery to search memory system 130 and retrieve relevant documents.

[0059] Server 140 may present the relevant documents to the user (act710). For example, the user may be presented with a list of relevantdocuments. The documents may include any combination of audio documents,video documents, and text documents. The user may select one or moredocuments on the list to view the document(s). In the case of an audioor video document, the user may be presented with a transcription of theaudio data or video data corresponding to the document.

[0060] If the user desires, the user may annotate a document. Forexample, the user may bookmark the document, highlight the document,and/or add a note to the document. FIG. 8 is a diagram of an exemplarygraphical user interface (GUI) 800 that displays an annotated documentaccording to an implementation consistent with the principles of theinvention. Similar to GUI 600, GUI 800 includes a speaker section, atranscription section, and a topics section.

[0061] GUI 800 may also include annotate button 810, a highlighted blockof text 820, and a note 830. If the user desires to annotate thedocument, the user may select annotate button 810. The user may then bepresented with a list of annotation options, such as adding a bookmark,highlight, or note. If the user desires to bookmark the document, theuser may select the bookmark option. In this case, GUI 800 may add aflag to the document so that the user may later be able to easilyretrieve the document from memory system 130. In some instances, theuser may be able to share bookmarks with other users.

[0062] If the user desires to highlight a portion of the document, theuser may select the highlight option. In this case, the user mayvisually highlight one or more portions of the document, such ashighlighted block 820. The highlight, or color of highlight, may providemeaning to highlighted block 820. For example, the highlight mightcorrespond to the user doing the highlighting, signify that highlightedblock 820 is important or unimportant, or have some other significance.When other users later retrieve this document, the users may see thehighlighting added by the user.

[0063] If the user desires to add a note to the document, the user mayselect the note option. In this case, the user may add a note 830 to thedocument or a portion of the document. Note 830 may include commentsfrom the user, a multimedia file (audio, video, or text), or a reference(e.g., a link) to another document in memory system 130. When otherusers later retrieve this document, the users may be able to see note830 added by the user.

[0064] GUI 800 may receive the information (bookmark, highlight, note)provided by the user and annotate the document accordingly onscreen.This way, the user may determine whether the information was correctlyprovided. GUI 800 may also send the annotated document to server 140.

[0065] Returning to FIG. 7, server 140 may receive the annotateddocument and send it to memory system 130 (act 720). Memory system 130may store the annotated document in database 430 (FIG. 4) (act 730).Thereafter, when any user retrieves this document from database 430, theuser gets the document with the annotation(s). This may aid the user inbrowsing the document, determining whether the document is one in whichthe user is interested, and retrieving other relevant documents.Alternatively, the document may be protected so that only the user whoannotated the document may later see the annotations.

[0066] Memory system 130 may also send the annotated document to one ormore of indexers 120 for retraining (act 740). Memory system 130 maysend the annotated document in the form of training data. For example,memory system 130 may put the annotated document in a special form foruse by indexers 120 to retrain. Alternatively, memory system 130 maysend the annotated document to indexers 120, along with an instructionto retrain.

[0067] Training system 210 (FIG. 2) of indexers 120 may use theannotated document to retrain. For example, training system 210 maysupplement its corpus of training data with the annotated document andgenerate new parameters for statistical model 220 based on thesupplemented corpus of training data.

[0068] Suppose, for example, that the user provided comments within anote attached to a portion of the document. The comments may includediscipline-specific words that indexers 120 cannot recognize or mayinclude names of people, places, or companies that indexers 120 have notseen before. Indexers 120 may use the comments in recognizing futureoccurrences of the discipline-specific words or the names. By retrainingbased on annotated documents, indexers 120 improve their recognitionresults.

[0069] Document Attachment

[0070]FIG. 9 is a flowchart of exemplary processing for attachingdocuments to documents stored in memory system 130 according to animplementation consistent with the principles of the invention.Processing may begin with a user desiring to retrieve one or moredocuments from memory system 130. The user may use a conventional webbrowser of client 150 to access server 140 in a conventional manner. Toobtain documents of interest, the user may generate a search query andsend the query to server 140 via client 150. Server 140 may use thequery to search memory system 130 and retrieve relevant documents.

[0071] Server 140 may present the relevant documents to the user (act910). For example, the user may be presented with a list of relevantdocuments. The documents may include any combination of audio documents,video documents, and text documents. The user may select one or moredocuments on the list to view the document(s). In the case of an audioor video document, the user may be presented with a transcription of theaudio data or video data corresponding to the document.

[0072] If the user desires, the user may attach a rich document to aportion of the document (“original document”). The rich document mayinclude an audio, video, or text document relevant to that particularportion of the original document or the entire original document. Forexample, the rich document may be relevant to a topic contained withinthe original document and may describe the topic in a way that the topicis not described in the original document.

[0073]FIG. 10 is a diagram of an exemplary graphical user interface(GUI) 1000 that facilitates attachment of a document according to animplementation consistent with the principles of the invention. Similarto GUI 600, GUI 1000 includes a speaker section, a transcriptionsection, and a topics section.

[0074] GUI 1000 may also include attach document button 1010. If theuser desires to attach a rich document, the user may select attachdocument button 1010. The user may then be presented with a list ofattachment options. For example, the user may cut-and-paste text of therich document into a window of GUI 1000. Alternatively, the user mayattach a file containing the rich document or provide a link to the richdocument. This may be particularly useful if the rich document is anaudio or video document. GUI 1000 may receive the attached document(i.e., rich document) from the user and provide the attached document toserver 140.

[0075] Returning to FIG. 9, server 140 may receive and parse theattached document (act 920). Server 140 may then send it to memorysystem 130 (act 920). Memory system 130 may store the attached documentin database 430 (FIG. 4) (act 930). Thereafter, when any user retrievesthe original document from database 430, the user may also get theattached document or a link to the attached document. This may aid theuser in finding documents of interest. Alternatively, the attacheddocument may be protected so that only the user who provided theattached document may later see the attached document or the link to theattached document.

[0076] Memory system 130 may also send the attached document to one ormore of indexers 120 for retraining (act 940). Memory system 130 maysend the attached document in the form of training data. For example,memory system 130 may put the attached document in a special form foruse by indexers 120 to retrain. Alternatively, memory system 130 maysend the attached document to indexers 120, along with an instruction toretrain.

[0077] Training system 210 (FIG. 2) of indexers 120 may use the attacheddocument to retrain. For example, training system 210 may supplement itscorpus of training data with the attached document and generate newparameters for statistical model 220 based on the supplemented corpus oftraining data.

[0078] Training system 210 may also extract certain information from theattached document. For example, training system 210 may generate likelypronunciations for unfamiliar words or determine that certain words arenames of people, places, or organizations based on their context withinthe document. By retraining based on attached documents, indexers 120improve their recognition results.

[0079] Optionally, memory system 130 may also send the attached documentfor recognition by an appropriate one of the indexers 120 (act 950). Forexample, if the attached document is an audio document, memory system130 may provide the attached document to the input of audio indexer 122for recognition. As described above, audio indexer 122 may segment theaudio document by speaker, cluster audio segments from the same speaker,identify speakers by name or gender, and transcribe the spoken words.Audio indexer 122 may also segment the audio document based on topic andlocate the names of people, places, and organizations, and identify wheneach word was spoken (possibly based on a time value). Audio indexer 122may then store this metadata in memory system 130.

[0080] Document Insertion

[0081]FIG. 11 is a flowchart of exemplary processing for adding newdocuments according to an implementation consistent with the principlesof the invention. Processing may begin with a user desiring to add oneor more documents to memory system 130. The user may use a conventionalweb browser of client 150 to access server 140 in a conventional mannerto provide a new document. Alternatively, server 140 may use an agent toactively seek out new documents, such as documents from a specificfolder (e.g., My Documents folder) on client 150 or documents on theInternet. In any event, the documents might include a user's personale-mail stream, a web page, or a word processing document.

[0082] The user may provide the new document in several ways. Forexample, the user may cut-and-paste text of the document. Alternatively,the user may provide a file containing the document or provide a link tothe document. This may be particularly useful if the document is anaudio or video document.

[0083] Server 140 may receive or obtain the document (act 1110). Forexample, if the user provided a link to the document, then server 140may use the link to retrieve the document using conventional techniques.Server 140 may then process the document (act 1120). For example, if thedocument is a web page, server 140 may parse the document and discardadvertisements and other extraneous information. Server 140 may thensend the document to memory system 130. Memory system 130 may store thedocument in database 430 (act 1130). The document may, thereafter, beavailable to other users.

[0084] Memory system 130 may also send the document to one or more ofindexers 120 for retraining (act 1140). Memory system 130 may send thedocument in the form of training data. For example, memory system 130may put the document in a special form for use by indexers 120 toretrain. Alternatively, memory system 130 may send the document toindexers 120, along with an instruction to retrain.

[0085] Training system 210 (FIG. 2) of indexers 120 may use the documentto retrain. For example, training system 210 may supplement its corpusof training data with the document and generate new parameters forstatistical model 220 based on the supplemented corpus of training data.

[0086] Training system 210 may also extract certain information from thedocument. For example, training system 210 may generate likelypronunciations for unfamiliar words or determine that certain words arenames of people, places, or organizations based on their context withinthe document. By retraining based on new documents, indexers 120 improvetheir recognition results.

[0087] Optionally, memory system 130 may also send the document forrecognition by an appropriate one of the indexers 120 (act 1150). Forexample, if the document is an audio document, memory system 130 mayprovide the document to the input of audio indexer 122 for recognition.As described above, audio indexer 122 may segment the audio document byspeaker, cluster audio segments from the same speaker, identify speakersby name or gender, and transcribe the spoken words. Audio indexer 122may also segment the audio document based on topic and locate the namesof people, places, and organizations, and identify when each word wasspoken (possibly based on a time value). Audio indexer 122 may thenstore this metadata in memory system 130.

CONCLUSION

[0088] Systems and methods consistent with the present invention permitusers to augment a database of a multimedia recognition system by, forexample, annotating, attaching, inserting, correcting, and/or enhancingdocuments. The systems and methods may use this user-augmentation toimprove the recognition results of the recognition system. For example,the user-augmentation may be used to improve the documents stored in thedatabase. The user-augmentation may also be used for system retraining.

[0089] The foregoing description of preferred embodiments of the presentinvention provides illustration and description, but is not intended tobe exhaustive or to limit the invention to the precise form disclosed.Modifications and variations are possible in light of the aboveteachings or may be acquired from practice of the invention.

[0090] For example, server 140 may elicit information from the user.Server 140 may ask the user to verify that a certain word corresponds toa person, place, or organization. Alternatively, server 140 may requestthat the user supply a document that relates to the word.

[0091] Also, exemplary graphical user interfaces have been describedwith regard to FIGS. 6, 8, and 10 as containing certain features invarious implementations consistent with the principles of the invention.It is to be understood that a graphical user interface, consistent withthe present invention, may include any or all of these features ordifferent features to facilitate the user-augmentation.

[0092] While series of acts have been described with regard to FIGS. 5,7, 9, and 11, the order of the acts may differ in other implementationsconsistent with the principles of the invention.

[0093] Further, certain portions of the invention have been described as“logic” that performs one or more functions. This logic may includehardware, such as an application specific integrated circuit or a fieldprogrammable gate array, software, or a combination of hardware andsoftware.

[0094] No element, act, or instruction used in the description of thepresent application should be construed as critical or essential to theinvention unless explicitly described as such. Also, as used herein, thearticle “a” is intended to include one or more items. Where only oneitem is intended, the term “one” or similar language is used. The scopeof the invention is defined by the claims and their equivalents.

What is claimed is:
 1. A multimedia recognition system, comprising: aplurality of indexers configured to: receive multimedia data, andanalyze the multimedia data based on training data to generate aplurality of documents; and a memory system configured to: store thedocuments from the indexers, receive user augmentation relating to oneof the documents, and provide the user augmentation to one or more ofthe indexers for retraining based on the user augmentation.
 2. Thesystem of claim 1, wherein the multimedia data includes at least two ofaudio data, video data, and text data.
 3. The system of claim 2, whereinthe indexers include at least two of: an audio indexer configured toperform speech recognition on the audio data based on the training data,a video indexer configured to perform at least one of video recognitionand speech recognition on the video data based on the training data, anda text indexer configured to perform text recognition on the text databased on the training data.
 4. The system of claim 1, wherein whenreceiving user augmentation relating to one of the documents, the memorysystem is configured to: receive correction of the one of the documents,as a corrected document, from a user, and store the corrected document.5. The system of claim 4, wherein when providing the user augmentationto one or more of the indexers, the memory system is configured to sendthe corrected document to the one or more of the indexers for retrainingbased on the corrected document.
 6. The system of claim 5, wherein theone or more of the indexers are configured to: add the correcteddocument to the training data, and retrain based on the training data.7. The system of claim 1, wherein when receiving user augmentationrelating to one of the documents, the memory system is configured to:receive enhancement of the one of the documents, as an enhanceddocument, from a user, and store the enhanced document.
 8. The system ofclaim 7, wherein when providing the user augmentation to one or more ofthe indexers, the memory system is configured to send the enhanceddocument to the one or more of the indexers for retraining based on theenhanced document.
 9. The system of claim 8, wherein the one or more ofthe indexers are configured to: add the enhanced document to thetraining data, and retrain based on the training data.
 10. The system ofclaim 1, wherein when receiving user augmentation relating to one of thedocuments, the memory system is configured to: receive annotation of theone of the documents, as an annotated document, from a user, and storethe annotated document.
 11. The system of claim 10, wherein whenproviding the user augmentation to one or more of the indexers, thememory system is configured to send the annotated document to the one ormore of the indexers for retraining based on the annotated document. 12.The system of claim 11, wherein the one or more of the indexers areconfigured to: add the annotated document to the training data, andretrain based on the training data.
 13. The system of claim 10, whereinwhen receiving annotation of the one of the documents, the memory systemis configured to at least one of: receive a bookmark relating to the oneof the documents, receive highlighting regarding one or more portions ofthe one of the documents, and receive a note relating to at least aportion of the one of the documents.
 14. The system of claim 13, whereinthe note includes one of comments from the user, one of an audio, video,and text file, and a reference to another one of the documents.
 15. Thesystem of claim 1, wherein when receiving user augmentation relating toone of the documents, the memory system is configured to: receive anattachment for the one of the documents from a user, and store theattachment.
 16. The system of claim 15, wherein when providing the useraugmentation to one or more of the indexers, the memory system isconfigured to send the attachment to the one or more of the indexers forretraining based on the attachment.
 17. The system of claim 16, whereinthe one or more of the indexers are configured to: add the attachment tothe training data, and retrain based on the training data.
 18. Thesystem of claim 15, wherein the attachment includes one of an audiodocument, a video document, and a text document, or a reference to theaudio document, the video document, or the text document.
 19. The systemof claim 15, wherein the memory system is further configured to: sendthe attachment for analysis by one or more of the indexers.
 20. Amultimedia recognition system, comprising: means for receiving aplurality of types of multimedia data; means for recognizing themultimedia data based on training data to generate recognition results;means for storing the recognition results; means for receiving useraugmentation relating to some of the recognition results; means foradding the user augmentation to the training data to obtain new trainingdata; and means for retraining based on the new training data.
 21. Amethod for improving recognition results, comprising: receivingmultimedia data; recognizing the multimedia data based on training datato generate a plurality of documents; receiving user augmentationrelating to one of the documents; supplementing the training data withthe user augmentation to obtain supplemented training data; andretraining based on the supplemented training data.
 22. The method ofclaim 21, wherein the multimedia data includes at least two of audiodata, video data, and text data.
 23. The method of claim 22, wherein therecognizing the multimedia data includes at least two of: performingspeech recognition on the audio data based on the training data,performing at least one of video recognition and speech recognition onthe video data based on the training data, and performing textrecognition on the text data based on the training data.
 24. The methodof claim 21, wherein the receiving user augmentation relating to one ofthe documents includes: receiving correction of the one of thedocuments, as a corrected document, from a user, and storing thecorrected document.
 25. The method of claim 24, wherein thesupplementing the training data includes: adding the corrected documentto the training data.
 26. The method of claim 21, wherein the receivinguser augmentation relating to one of the documents includes: receivingenhancement of the one of the documents, as an enhanced document, from auser, and storing the enhanced document.
 27. The method of claim 26,wherein the supplementing the training data includes: adding theenhanced document to the training data.
 28. The method of claim 21,wherein the receiving user augmentation relating to one of the documentsincludes: receiving annotation of the one of the documents, as anannotated document, from a user, and storing the annotated document. 29.The method of claim 28, wherein the supplementing the training dataincludes: adding the annotated document to the training data.
 30. Themethod of claim 28, wherein the receiving annotation of the one of thedocuments includes at least one of: receiving a bookmark relating to theone of the documents, receiving highlighting regarding one or moreportions of the one of the documents, and receiving a note relating toat least a portion of the one of the documents.
 31. The method of claim30, wherein the note includes one of comments from the user, one of anaudio, video, and text file, and a reference to another one of thedocuments.
 32. The method of claim 21, wherein the receiving useraugmentation relating to one of the documents includes: receiving anattachment for the one of the documents from a user, and storing theattachment.
 33. The method of claim 32, wherein the supplementing thetraining data includes: adding the attachment to the training data. 34.The method of claim 32, wherein the attachment includes one of an audiodocument, a video document, and a text document, or a reference to theaudio document, the video document, or the text document.
 35. The methodof claim 23, further comprising: performing at least one of speechrecognition, video recognition, and text recognition on the attachment.36. A computer-readable medium that stores instructions executable byone or more processors for improving recognition of multimedia data,comprising: instructions for acquiring multimedia data; instructions forrecognizing the multimedia data based on training data to generate aplurality of documents; instructions for obtaining user augmentationrelating to one of the documents; instructions for adding the useraugmentation to the training data to obtain new training data; andinstructions for retraining based on the new training data.
 37. Amultimedia recognition system, comprising: a plurality of indexersconfigured to: receive multimedia data, and analyze the multimedia databased on training data to generate a plurality of documents; and amemory system configured to: store the documents from the indexers,obtain new documents, store the new documents, and provide the newdocuments to one or more of the indexers for retraining based on the newdocuments.
 38. The system of claim 37, wherein the multimedia dataincludes at least two of audio data, video data, and text data.
 39. Thesystem of claim 38, wherein the indexers include at least two of: anaudio indexer configured to perform speech recognition on the audio databased on the training data, a video indexer configured to perform atleast one of video recognition and speech recognition on the video databased on the training data, and a text indexer configured to performtext recognition on the text data based on the training data.
 40. Thesystem of claim 37, wherein when obtaining one of the new documents, thememory system is configured to at least one of: receive text that hasbeen cut-and-pasted, receive a file containing the one of the newdocuments, and receive a link to the one of the new documents.
 41. Thesystem of claim 37, wherein when obtaining the new documents, the memorysystem is configured to: employ an agent to actively seek out andretrieve new documents.
 42. The system of claim 37, wherein the one ormore of the indexers are configured to: add the new documents to thetraining data, and retrain based on the training data.
 43. A multimediarecognition system, comprising: means for receiving a plurality of typesof multimedia data; means for recognizing the multimedia data based ontraining data to generate recognition results; means for obtaining newdocuments from one or more users; means for adding the new documents tothe training data to obtain new training data; and means for retrainingbased on the new training data.
 44. A method for improving recognitionresults, comprising: receiving multimedia data; recognizing themultimedia data based on training data to generate a plurality ofdocuments; obtaining new documents; supplementing the training data withthe new documents to obtain supplemented training data; and retrainingbased on the supplemented training data.
 45. The method of claim 44,wherein the multimedia data includes at least two of audio data, videodata, and text data.
 46. The method of claim 45, wherein the recognizingthe multimedia data includes at least two of: performing speechrecognition on the audio data based on the training data, performing atleast one of video recognition and speech recognition on the video databased on the training data, and performing text recognition on the textdata based on the training data.
 47. The method of claim 44, wherein theobtaining new documents includes at least one of: receiving text thathas been cut-and-pasted, receiving one or more files containing the newdocuments, and receiving one or more links to the new documents.
 48. Themethod of claim 44, wherein the obtaining the new documents includes:actively seeking out and retrieving the new documents.